AITopics | sequence identity

Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2512.09329

Country:

Europe > France (0.05)
North America > United States > California (0.05)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Decoding the dark proteome: Deep learning-enabled discovery of druggable enzymes in Wuchereria bancrofti

Shivakumar, Shawnak, Hernandez, Jefferson

arXiv.org Artificial IntelligenceOct-10-2025

Wuchereria bancrofti, the parasitic roundworm responsible for lymphatic filariasis, permanently disables over 36 million people and places 657 million at risk across 39 countries. A major bottleneck for drug discovery is the lack of functional annotation for more than 90 percent of the W. bancrofti dark proteome, leaving many potential targets unidentified. In this work, we present a novel computational pipeline that converts W. bancrofti's unannotated amino acid sequence data into precise four-level Enzyme Commission (EC) numbers and drug candidates. We utilized a DEtection TRansformer to estimate the probability of enzymatic function, fine-tuned a hierarchical nearest neighbor EC predictor on 4,476 labeled parasite proteins, and applied rejection sampling to retain only four-level EC classifications at 100 percent confidence. This pipeline assigned precise EC numbers to 14,772 previously uncharacterized proteins and discovered 543 EC classes not previously known in W. bancrofti. A qualitative triage emphasizing parasite-specific targets, chemical tractability, biochemical importance, and biological plausibility prioritized six enzymes across five separate strategies: anti-Wolbachia cell-wall inhibition, proteolysis blockade, transmission disruption, purinergic immune interference, and cGMP-signaling destabilization. We curated a 43-compound library from ChEMBL and BindingDB and co-folded across multiple protein conformers with Boltz-2. All six targets exhibited at least moderately strong predicted binding affinities below 1 micromolar, with moenomycin analogs against peptidoglycan glycosyltransferase and NTPase inhibitors showing promising nanomolar hits and well-defined binding pockets. While experimental validation remains essential, our results provide the first large-scale functional map of the W. bancrofti dark proteome and accelerate early-stage drug development for the species.

artificial intelligence, enzyme, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.07337

Country:

North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

636d57c09a5baacd83722639265802f6-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 19:22:24 GMT

database, sequence, sequence identity, (16 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.76)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)
Information Technology > Artificial Intelligence > Natural Language (0.69)

Add feedback

Evaluating Protein Transfer Learning with TAPE

Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song

Neural Information Processing SystemsOct-2-2025, 13:08:07 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, protein, (16 more...)

Neural Information Processing Systems

Country:

Europe > France (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

We thank the reviewers for their positive feedback and helpful suggestions for improvement

Neural Information Processing SystemsOct-2-2025, 13:07:51 GMT

We will present these results separately in Table 1 of the paper.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.43)
Information Technology > Artificial Intelligence > Natural Language (0.32)

Add feedback

f51338d736f95dd42427296047067694-Supplemental.pdf

Neural Information Processing SystemsAug-18-2025, 21:46:57 GMT

artificial intelligence, machine learning, sequence, (16 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Lebanon > Keserwan-Jbeil Governorate > Blat (0.06)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

Wong, Brian Shing-Hei, Kim, Joshua Mincheol, Fung, Sin-Hang, Xiong, Qing, Ao, Kelvin Fu-Kiu, Wei, Junkang, Wang, Ran, Wang, Dan Michelle, Zhou, Jingying, Feng, Bo, Cheng, Alfred Sze-Lok, Yip, Kevin Y., Tsui, Stephen Kwok-Wing, Cao, Qin

arXiv.org Artificial IntelligenceAug-18-2025

Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm's performance by detecting important differences among protein sequences. In addition to providing Applm as open-source software, we also provide our carefully curated benchmark datasets to facilitate future research.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2508.10541

Country: